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HIERARCHICAL  DICTIONARY  LEARNING  FOR  INVARIANT  CLASSIFICATION 


Leah  Bar  and  Guillermo  Sapiro 

Department  of  Eleetrieal  and  Computer  Engineering,  University  of  Minnesota,  Minneapolis,  USA 


ABSTRACT 

Sparse  representation  theory  has  been  increasingly  used  in  the  fields 
of  signal  processing  and  machine  learning.  The  standard  sparse 
models  are  not  invariant  to  spatial  transformations  such  as  image 
rotations,  and  the  representation  is  very  sensitive  even  under  small 
such  distortions.  Most  studies  addressing  this  problem  proposed  al¬ 
gorithms  which  either  use  transformed  data  as  part  of  the  training 
set,  or  are  invariant  or  robust  only  under  minor  transformations.  In 
this  paper  we  suggest  a  framework  which  extracts  invariant  sparse 
features  under  significant  rotations  and  scalings.  The  algorithm  is 
based  on  a  hierarchical  architecture  of  dictionary  learning  for  sparse 
coding  in  a  cortical  (log-polar)  space.  The  proposed  model  is  tested 
in  supervised  classification  applications  and  proved  to  be  robust  un¬ 
der  transformed  data  sets. 

Index  Terms —  Sparse  models,  dictionary  learning,  hierarchy, 
log-polar,  invariance,  classification 

1.  INTRODUCTION 

Sparse  signal  models  over  learned  dictionaries  were  proved  to  be 
very  powerful  in  recent  years  in  the  fields  of  image  processing, 
speech  processing,  and  machine  learning.  Sparse  representations 
have  the  advantage  of  capturing  inherent  structures  of  the  signal  and 
demonstrate  relative  robustness  to  (additive)  noise.  In  their  standard 
form,  these  compact  representations  are  not  invariant  under  trans¬ 
formations  such  as  translation,  scaling  and  rotation.  Kavukcuoglu 
et  al.  [1]  learn  locally-invariant  feature  descriptors  by  pooling  the 
sparse  coefficients  across  overlapping  windows.  Yet,  their  algorithm 
is  designed  for  relatively  small  distortions.  Shift-invariant  dictionary 
learning  was  investigated  by  [2,  3].  The  idea  in  both  papers  is  to 
train  the  dictionaries  on  many  possible  shifted  versions  of  the  signal 
(see  also  [4]  for  related  ideas  extending  the  popular  SIFT  descriptor 
to  the  affine  case).  This  approach  may  render  to  be  computational 
expensive,  and  in  addition,  several  transformations  may  lead  to  im¬ 
practical  implementation.  Huang  et  al.  [5]  simultaneously  recover 
the  sparse  representation  of  the  target  image  and  the  geometrical 
(translation/affine)  transformation  between  the  target  and  model  im¬ 
ages.  The  transformations  are  approximated  by  first  order  Taylor 
expansion  and  therefore  have  again  the  limitation  of  being  relatively 
small. 

Invariance  can  be  approached  by  biologically-inspired  architec¬ 
tures.  Typically,  the  extraction  of  local  features  is  followed  by  spatial 
pooling  which  is  classically  modeled  as  a  hierarchy  of  increasingly 
complex  structures.  These  ideas  led  to  extensive  research  and  algo¬ 
rithms.  Serre  et  al.  [6],  for  example,  suggested  a  scale  and  position 
tolerant  feature  detector  based  on  the  alternation  between  template 
matching  and  a  maximum  pooling  operator.  Ranzato  et  al.  [7]  sug¬ 
gested  a  hierarchical  feature  extraction  algorithm  which  is  invariant 
under  small  shifts  and  distortions. 

In  this  paper  we  introduce  a  framework  for  dictionary  learning 


and  sparse  feature  extraction,  which  is  invariant  under  significant  or¬ 
dinary  rotation  and  scaling  transformations.  In  the  proposed  method, 
we  integrate  the  ideas  of  sparse  representation  theory  and  hierarchi¬ 
cal  structures.  By  using  a  special  conformal  (log-polar)  mapping  of 
the  data,  rotated  and  scaled  patterns  are  converted  into  shifted  pat¬ 
terns  in  the  new  space  on  which  we  operate  for  learning  the  dictio¬ 
nary  and  adding  hierarchy.  Our  approach  is  closely  related  to  [8], 
where  log-polar  images  were  represented  by  invariant  wavelet  pack¬ 
ets.  Yet,  in  our  work  we  incorporate  a  hierarchical  architecture  that 
is  designed  to  also  eliminate  the  effects  of  translations  in  this  space. 
As  we  show  later,  a  hierarchical  approach  performs  better  than  a 
one-layered  one.  Moreover,  we  learn  dictionaries  instead  of  using 
pre-defined  wavelets,  following  the  recent  results  in  the  literature 
clearly  showing  that  such  learned  dictionaries  often  outperform  off- 
the-shelf  ones. 

The  suggested  approach  is  general  and  suited  for  data  of  very 
different  nature.  The  method  was  particularly  tested  with  two  appli¬ 
cations:  in  the  first  example,  we  classified  handwritten  digits  with 
only  aligned  training  patterns  and  transformed  tested  patterns  with 
significant  rotations  and  scalings.  Next,  classification  of  texture  im¬ 
ages  with  large  variability  of  scaling  and  rotations  was  performed. 
Promising  results  support  the  stability  and  robustness  of  the  sug¬ 
gested  approach. 

2.  BACKGROUND  AND  NOTATIONS 

In  sparse  modeling  representation,  a  signal  x  G  R"  is  represented 
as  a  linear  combination  of  basis  column  vectors  dj  G  R"  (atoms) 
which  form  a  dictionary  D  G  R”^^,  such  that  x  =  Da.  The  vec¬ 
tor  a  G  R^  is  assumed  to  be  sparse,  meaning  that  the  number  of 
non-zero  elements  is  much  smaller  than  K.  A  dictionary  can  be 
overcomplete,  K  ^  n,  as  is  often  used  in  restoration  algorithms. 
Most  classification  algorithms  on  the  other  hand,  use  undercomplete 
(K  <  n)  dictionaries.  Given  a  dictionary  D  and  a  signal  x,  the  l\ 
sparse  coding  problem  is  given  by 

a  =  argmin  11®  —  Da||^ -I- A||a||i,  (1) 

cx. 

where  A  G  R  is  a  regularization  constant.  This  formulation  and  its 
variants  are  referred  to  as  basis  pursuit  or  Imsso  [9].  The  optimiza¬ 
tion  in  this  work  was  carried  out  using  the  Lars  [10]  algorithm  which 
we  denote  as  (a)  ^  Lars(®,  D). 

Consider  a  set  of  m  signals  X  =  [®i, . . . ,  Xm]  £  R"^"*.  The 
dictionary  D  and  coefficients  set  a  =  [ai, . . . ,  a™]  G  R^^™  are 
given  by 

m 

D,  a  =  argminN^  \\xi  —  Y)ai\\^  +  A||Qi||i.  (2) 

D.ck  ^ 

i=l 

The  optimization  is  performed  by  alternate  minimization  w.r.t.  a 
and  D.  We  denote  the  dictionary  learning  process  as  (D,  a)  <— 
TrainDictionary(X).  Detailed  description  of  both  algorithms  can 
be  found  for  example  in  [1 1]. 


3.  CLASSIFICATION  VIA  SPARSE  CODING 


Let  be  a  labeled  training  set,  and  the  unlabeled  testing 

set.  Our  goal  is  to  learn  a  classifier  which  is  robust  under  a  group 
of  transformations,  present  in  X*®®*  but  not  necessarily  present  in 
jjtr-ain  begin  by  presenting  a  simple  classification  algorithm 
based  on  a  sparse  reconstructive  model,  and  continue  with  introduc¬ 
ing  the  invariant  hierarchy-based  approach. 

The  procedure  which  we  refer  to  as  STL,  follows  for  example 
the  Self-taught  Learning  via  Sparse  Coding  algorithm  [12]  (see  also 
[13]).  The  idea  is  to  learn  a  dictionary  from  an  unlaheled  dataset. 
Then  the  sparse  coding  coefficients  obtained  when  coding  elements 
of  the  labeled  dataset  serve  as  features  which  are  fed  into  an  SVM 
classifier.*  In  our  implementation,  the  dictionary  was  trained  with 
the  aligned  labeled  data.  New  data  is  then  classified  with  the  learned 
dictionary  and  SVM. 

Algorithm  STL 

1.  (D,  a)  ^  TrainDictionary(X‘’’“®") 

2.  Learn  a  classifier  C  by  a  linear  SVM  based  on  a. 

3.  (/()«- Lars(X‘""‘,D) 

4.  Classify  the  set /3  by  C. 

This  algorithm  is  very  effective  in  the  case  that  the  training  and  test¬ 
ing  sets  are  aligned.  Even  though  there  are  state-of-the-art  algorithm 
which  have  preferable  performance,  e.g.,  [13],  they  are  based  on 
discriminative  dictionary  learning  models  in  the  sense  of  a  modified 
version  of  Equation  (2).  In  our  approach  on  the  other  hand,  we  use 
a  simpler  reconstructive  one  which  is  based  on  (2).  Extending  the 
framework  here  presented  to  such  discriminative  models  is  part  of 
our  ongoing  efforts. 

4.  HIERARCHICAL  DICTIONARY  LEARNING 

Two  main  approaches  are  widely  used  to  deal  with  invariant  features 
in  frameworks  as  the  one  here  proposed.  One  strategy  is  to  train  the 
system  with  as  many  transformed  pattern  as  possible.  Alternatively, 
invariant  features  with  much  smaller  training  sets  can  be  extracted. 
In  the  proposed  method,  we  follow  this  second  approach,  and  the 
invariant  characteristics  are  implicitly  captured.  Input  images  are 
first  transformed  by  a  conformal  mapping  such  that  rotations  and/or 
scaling  are  reduced  to  horizontal  and/or  vertical  translations.  Dictio¬ 
naries  are  then  trained  with  this  data  in  a  hierarchical  fashion:  The 
outcome  associated  with  grouped  sub  blocks  from  one  layer  serve 
as  the  input  to  a  new  layer  of  learned  dictionaries.  Finally,  trans¬ 
lation  invariance  is  accomplished  by  a  further  special  hierarchical 
transform. 

4.1.  Log-Polar  Mapping 

Images  can  be  represented  in  different  spaces.  Fischer  [14]  origi¬ 
nally  suggested  that  the  transformation  of  the  visual  field  into  its  neu¬ 
ral  representation  is  approximated  by  a  complex  logarithmic  map¬ 
ping  W  =  log{Z),  where  Z  =  a  ib  (a  and  b  are  the  spatial 
coordinates  in  the  image  domain)  and  W  =  ^  irj  are  complex 
numbers  that  define  the  retinal  and  cortical  spaces  respectively.  The 
mapping  is  given  by:  ^  =  log{a^  -f  b^)  and  rj  =  tan~^{b/a).  This 
is  the  map  used  to  transform  the  image  data. 

'LIBSVM  package:  http://www.csie.ntu.edu.tw/  cjlin/libsvm/ 


w 

-■■n' 

m 

Eig.  1.  Left:  Rotations  in  the  retinal  space  (top)  are  converted  into 
cyclic  shifts  in  the  cortical  space  (bottom).  Right:  Scalings  in  the 
retinal  space  (top)  are  converted  into  shifts  along  the  vertical  axis  in 
the  cortical  space  (bottom). 


Clearly,  rotations  are  converted  into  cyclic  translation  along  the 
rj  axis,  while  scalings  are  converted  into  translation  along  the  ^  axis. 
Fig.  1. 

4.2.  Rapid  Transform 

We  now  briefly  describe  a  non-linear  hierarchical  transformation 
which  is  invariant  under  cyclic  permutation.  It  will  be  used  later  as 
we  describe  the  proposed  algorithm.  The  Rapid  transform,  presented 
in  the  frame  below,  was  suggested  by  Reitboeck  and  Brody  [15], 
and  was  widely  used  in  pattern  recognition  algorithms,  e.g.,  [16]. 
Let  U  be  a  vector  of  M  =  2"  elements.  Then,  the  output  vector 
V  ^Rapid(U)  is  invariant  under  cyclic  translations  (the  proof  can 
be  found  in  [15]). 

V  ^Rapid(U) 

1.  Let  U(i)  elements  of  a  vector,  i  =  1, . . . ,  M,  M  =  2", 

VO=U. 

2.  for  s  =  1  to  n 

3.  V'’(2i-1)  -f  M/2) 

4.  V'’(2i)  =  |V'’-*(i)  -  M/2)|. 


4.3.  Hierarchical  Invariant  Algorithm  (HIA) 

Based  on  the  previous  sections,  we  describe  now  the  proposed  al¬ 
gorithm.  Let  /‘’■“"[fc]  G  =  1, . . . ,  be  a  set  of 

training  images,  and  G  the  corresponding  log-polar 

mapping.  The  dimension  of  the  original  image  is  not  necessarily 
identical  to  the  dimension  of  the  log-polar  one,  since  angular/radial 
resolution  in  the  cortical  space  may  be  controlled.  In  the  case  of 
shapes  images  (like  digits),  the  origin  of  the  polar  coordinate  sys¬ 
tem  is  determined  by  the  center  of  mass  of  the  shape,  otherwise  the 
origin  is  the  center  of  the  image.  Every  log-polar  image  [/j]  is 

divided  into  Hp  x  Wp  overlapping  patches  Xi  G  R'/”  ^  'Z"  (shown  in 
Fig.  2).  Let  us  now  concatenate  the  whole  patches  from  all  training 


The  first  layer  dictionary  Di  of  size  Ki  is  now  calculated  based 
on  the  training  set  The  dictionary  learning  procedure  yields 

also  the  training  coefficients  set  ai  G  corre¬ 

sponding  to  the  sparse  code.  We  are  now  ready  for  the  second  layer 
of  the  hierarchy.  Motivated  by  capturing  the  most  representative 
structures  of  the  data,  each  patch  is  now  replaced  by  its  most  promi¬ 
nent  atom  di,  e.g.,  the  atom  which  has  the  maximum  a. 

Let  us  now  group  some  of  such  atoms  to  a  unit  ys,r  which  forms 
a  sub-column.  A  full  column  accommodates  He  sub-columns,  and 
there  are  total  of  Wp  x  He  overlapping  sub-columns  per  image  (shad¬ 
owed  blocks  in  Fig.  2).  Once  again,  we  concatenate  the  whole  sub- 


5.  EXPERIMENTAL  RESULTS 


K - 


Fig.  2.  Hierarchical  structure  of  blocks.  The  white  -^n  x  y/n  patches 
are  used  in  the  first  layer  of  the  hierarchy,  while  the  shadowed  sub¬ 
columns  are  used  in  the  second  one. 


columns  from  the  training  images  to  =  [. . . , ys,r[k], . . .], 

s  =  1, . . . ,  TTc,  r  =  1, . . . ,  Wp,  fe  =  1, . . . ,  The  dictionary 

D2  of  size  K2  and  coefficients  set  0:2  are  now  calculated  based  on 

■^train 

In  the  next  two  steps  we  obtain  the  desired  invariance  properties. 
From  now  on,  we  process  every  image  k  separately.  Let  be  the 
coefficients  vector  associated  with  sub-column  ysy- 


The  proposed  algorithms  were  tested  with  two  different  databases: 
handwritten  digits  and  textures  from  multiple  view  points.  In  both 
algorithms  we  used  undercomplete  dictionaries  which  are  known  to 
be  effective  in  classification  tasks.  For  fair  comparisons,  dictionary 
sizes  were  manually  optimized.  All  the  experiments  reported  in 
this  section  share  the  same  dictionary  sizes.  For  the  STL  algorithm, 
K  =  64.  For  both  hierarchical  levels  in  the  suggested  algorithm,  the 
dictionary  sizes  were  Ki  —  256  and  K2  =  256. 

Algorithm  HIA 

1.  (Di,ai)  ^  TrainDictionary(X*’'“®") 

2.  <—  dAi,  where  Ai  =  argmax; 

3.  Group  atoms  to  sub-columns  j/s,r. 

4.  =  [t/i,i[l],...,y,,,[iV*™"]] 

5.  (D2,a2)  ^  TrainDictionary 

6.  For  each  image  k 

7.  Sum  over  columns:  02  =  £*2’^- 

8.  a’’  ^  Rapid  (02). 

9.  end 

10.  Learn  a  classifier  C  based  on  (non-grouped)  a'",  or  grouped 
set  per  image  {SA}- 


Hc,Wp 

^2 

0^2 

2,1 

^2 

1,1 

1,2 

l,Wp 

0-2 

a-2 

1 

2 

Wp 

Oi2 

02 

0-2 

Scale  invariance  is  accomplished  by  summing  the  coefficients  over 
a  column,  such  that  0.2  =  (shadowed  row).  Clearly,  the 

sum  of  the  coefficients  is  invariant  under  their  permutations,  and  the 
vertical  translations  (due  to  scalings)  are  canceled. 

Every  image  k  is  now  represented  by  Wp  vectors.  These  arrays 
are  now  fed  into  the  Rapid  transform,  where  every  element  U (r)  is 
represented  by  aj-  The  outcome  of  the  rapid  transform  is  denoted 
by  ct" .  As  was  explained  in  Section  4.2,  the  transformed  vector  is 
invariant  under  cyclic  translations,  and  the  coefficients  a’’  are  there¬ 
fore  rotation  invariant. 

The  last  step  is  learning  the  SVM  classifier  C.  One  option  is  to 
train  FFp  sets  of  3’'  £  .  The  other  option  is  to  group  all 

the  coefficients  associated  to  image  k,  meaning  that  we  train 
sets  of  [(3^)^, . . . ,  (3’^p)^]  £  The  whole  learning  algo¬ 

rithm  is  summarized  in  the  HIA  algorithm  frame. 

Given  a  new  testing  data  set,  invariant  features  /?’'  are  calculated 
by  the  above  procedure  using  the  learned  Di  and  D2 .  Classification 
is  then  based  on  grouped/non  grouped  (3^  and  SVM. 

The  lA  algorithm  (see  frame  below)  was  designed  to  evaluate  the 
significance  of  the  hierarchical  approach.  The  algorithm  is  similar  to 
HIA  except  that  the  second  dictionary  learning  stage  is  omitted.  Ex¬ 
perimental  results  support  the  superiority  of  the  hierarchical  model. 


Algorithm  lA  The  same  as  HIA.  Skip  stages  2-5  and  substitute 
Q!2  <—  oti  in  7,8. 

The  re-sampled  USPS  [17]  dataset  contains  4649  training  images 
and  4649  testing  images  of  size  16  x  16  which  were  centered  in  a 
24  X  24  matrix.  Following  [13],  the  whole  image  served  as  a  patch 
in  the  STL  implementation.  As  for  the  HIA  algorithm,  the  resolution 
of  the  log-polar  images  was  increased  to  40  x  40,  and  patches  of 
10  X  10  with  overlap  of  8  pixels  were  used.  The  invariant  features 
were  grouped  such  that  the  SVM  classifier  was  trained  with  4649 
vectors. 

In  the  first  experiment  (Table  1),  we  trained  both  systems  with 
aligned  digits  and  then  classified  the  aligned  testing  set.  As  ex¬ 
pected,  the  STL  algorithm  performs  a  little  better  than  the  HIA  (first 
raw).  This  could  be  explained  by  the  fact  that  HIA  incorporates 
lots  of  overheads  and  interpolations  during  the  log-polar  mapping. 
Next,  we  trained  the  dictionaries  with  random  rotated  digits  in  an 
angles  range  of  [—50°,  50°].  The  testing  set  was  randomly  rotated 
as  well.  Classification  results  in  this  case  were  very  close  (second 
row),  which  makes  sense  due  to  the  fact  that  STL  learns  different 
angles  possibilities.  Lastly,  we  added  a  scaling  effect  of  ±20%  digit 
size.  Since  the  parameter  K  was  fixed  in  all  cases,  a  better  result 
was  obtained  for  the  HIA  algorithm  (third  row).  The  STL  dictio¬ 
nary  was  not  rich  enough  to  accommodate  as  many  combinations  of 
scales  and  rotations  in  the  learning  set.  On  the  other  hand,  the  HIA 
dictionary  learned  invariant  features  and  therefore  performed  better. 

The  next  experiment  is  summarized  in  Table  2.  In  this  case,  dic¬ 
tionaries  were  learned  from  aligned  digits  only,  and  testing  images 
were  randomly  rotated  and  scaled.  Fig.  3  illustrates  few  samples 
from  both  sets.  The  superiority  of  the  suggested  algorithm  is  clear 
even  when  compared  to  lA  algorithm.  The  rotation  range  was  se¬ 
lected  such  that  there  would  be  no  confusion  between  the  digits  6 
and  9.  Experiments  which  exclude  the  digit  9  yield  classification  ac- 


Data  Set 

en 

Scale 

STL 

HIA 

10  digits 

0 

1 

95.8 

93.7 

±50 

1 

89.8 

88.7 

±50 

±0.2 

83.8 

88.0 

Table  1.  Classification  accuracy  for  the  digits  data  [%].  Both  train¬ 
ing  and  testing  samples  are  randomly  transformed. 


Data  Set 

en 

Scale 

STL 

HIA 

lA 

10  digits 

±50 

1 

59.1 

86.0 

76.9 

±50 

±0.2 

55.6 

83.6 

73.3 

3  Textures 

- 

- 

76.4 

94.7 

- 

4  Textures 

- 

- 

75.6 

91.2 

- 

Table  2.  Classification  accuracy  for  the  digits  and  texture  data  [%]. 
Only  testing  samples  are  randomly  transformed  for  the  digits  (such 
transformations  are  natural  in  the  texture  dataset). 


curacy  of  77.5%  (versus  36.4%  using  STL)  with  the  whole  rotation 
range  of  [0°,  360°]. 

In  the  second  example  we  used  a  texture  database  [18]  with  high 
variability  of  scaling  and  viewpoints  within  each  class  (Fig.  4).  The 
database  contains  40  images  of  size  480  x  640  for  every  class.  We 
used  25  images  for  training  and  15  images  for  testing.  For  the  STL 
algorithm,  optimal  patches  size  was  20  x  20  with  an  overlap  of  12 
pixels.  As  for  the  HIA  algorithm,  patches  of  50  x  50  with  42  pixels 
overlap  were  used.  Features  vectors  in  this  case  were  not  grouped 
and  every  sub-image  was  classified  independently.  The  results  are 
presented  in  the  bottom  panel  of  Table  2.  Once  again,  the  robust¬ 
ness  of  the  algorithm  is  verified  by  the  tolerance  under  significant 
geometric  transformations. 


6.  CONCLUSIONS 

In  this  paper,  we  showed  that  a  hierarchical  approach  to  dictionary 
learning,  combined  with  a  cortical  (log-polar)  transform,  plays  a 
significant  role  in  automatic  invariant  features  extraction.  The  sug¬ 
gested  algorithm  demonstrated  very  promising  results  in  the  case  of 
transformed  pattern  classification.  In  future  work  we  would  like  to 
study  additional  transformations,  such  as  affine,  and  also  the  intro¬ 
duction  of  this  framework  in  discriminative  dictionary  learning  [13]. 
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Fig.  3.  Samples  from  the  training  set  (top)  and  transformed  testing 
set  (bottom). 


Fig.  4.  Samples  from  a  textures  database.  Every  row  represents  a 
different  class. 
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